Data Visualization and Processing


By: Kris Ghimire, Thad Schwebke, Walter Lai, and Jamie Vo Crime Photo Cred.: Photo by kat wilcox from Pexels

In [130]:
# Load in libraries

# general libraries
import pandas as pd
import numpy as np
import os

# hide warnings
import warnings
warnings.filterwarnings('ignore')

# visualizations libraries
import seaborn as sns
import plotly 
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from geopy.geocoders import Nominatim
%matplotlib inline

# Machine Learning 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.utils import resample
from sklearn.feature_selection import RFE
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import statsmodels.api as sm

import random
In [ ]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

Business Understanding (10 pts)


Purpose of the dataset.

Homocide Data

(i.e., why was this data collected in the first place?).

The Murder Accountability Project is a nonprofit organization that discovers discrepancies between the reported homicides between medical examiners and the FBI voluntary crime report. The database is considered to be one of the most exhaustive record collection of homicides that is currently avaiable for the US. Additional information about the organization can be found at Murder Accountability Project.

The dataset dates back to 1967 and includes demographic information such as gender, age, and ethnicity. A more in depth description of the attributes may be found in the Data Description section.

In [2]:
# read in the data
df = pd.read_csv('../Data/database.csv')
In [3]:
# print the number of records and columns
records = len(df)
attributes = df.columns

print(f'No. of Records: {records} \nNo. of Attributes: {len(attributes)}')
No. of Records: 638454 
No. of Attributes: 24

Define and measure the dataset outcomes.

Describe how you would define and measure the outcomes from the dataset. That is, why is this data important and how do you know if you have mined useful knowledge from the dataset?

In [ ]:
 

Model Statistics

How would you measure the effectiveness of a good prediction algorithm? Be specific.

  • Classification
    • accuracy measures, applying a confusion matrix to return accuracy and precision measures
    • sensitivity and specificity, reducing the bias of the model
    • cross validation (CV PRESS) to determine model's prediction capabilities
    • model comparison (AIC, BIC, etc...)
  • Regression
    • R^2 to determine the fit of line
    • MSE to reduce deviations and increase accuracy
    • ASE to accurately determine the precision and accuracy of the model
In [ ]:
 

Data Understanding (80 pts total)


[10 points] Data Description:

Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file.

In [4]:
df_description = pd.read_excel('../Data/data_description.xlsx')
pd.set_option('display.max_colwidth', 0)
df_description
Out[4]:
Unnamed: 0 Attributes Description
0 0 Record ID NaN
1 1 Agency Code NaN
2 2 Agency Name NaN
3 3 Agency Type NaN
4 4 City NaN
5 5 State NaN
6 6 Year NaN
7 7 Month NaN
8 8 Incident NaN
9 9 Crime Type NaN
10 10 Crime Solved NaN
11 11 Victim Sex NaN
12 12 Victim Age NaN
13 13 Victim Race NaN
14 14 Victim Ethnicity NaN
15 15 Perpetrator Sex NaN
16 16 Perpetrator Age NaN
17 17 Perpetrator Race NaN
18 18 Perpetrator Ethnicity NaN
19 19 Relationship NaN
20 20 Weapon NaN
21 21 Victim Count NaN
22 22 Perpetrator Count NaN
23 23 Record Source NaN

[15 points] Verify data quality:

Explain any missing values, duplicate data, and outliers. Are those mistakes? How do you deal with these problems? Be specific.


Missing Values

  • all missing values were precoded to be 'Unknown' in the dataset

Duplicates

  • there are no duplicates in the dataset, shown belowin the duplicates dataframe

Outliers

In [5]:
df_duplicates = df.groupby(df.columns.tolist(),as_index=False).size()
df_duplicates.loc[df_duplicates['size'] > 1]
Out[5]:
Record ID Agency Code Agency Name Agency Type City State Year Month Incident Crime Type ... Perpetrator Sex Perpetrator Age Perpetrator Race Perpetrator Ethnicity Relationship Weapon Victim Count Perpetrator Count Record Source size

0 rows × 25 columns

[10 points] Statistics:

Give simple, appropriate statistics (range, mode, mean, median, variance, counts, etc.) for the most important attributes and describe what they mean or if you found something interesting. Note: You can also use data from other sources for comparison. Explain the significance of the statistics run and why they are meaningful.

In [6]:
# basic statistics of categorical data
df_categorical = df.select_dtypes(include='object')
df_categorical.describe()
Out[6]:
Agency Code Agency Name Agency Type City State Month Crime Type Crime Solved Victim Sex Victim Race Victim Ethnicity Perpetrator Sex Perpetrator Age Perpetrator Race Perpetrator Ethnicity Relationship Weapon Record Source
count 638454 638454 638454 638454 638454 638454 638454 638454 638454 638454 638454 638454 638454 638454 638454 638454 638454 638454
unique 12003 9216 7 1782 51 12 2 2 3 5 3 3 191 5 3 28 16 2
top NY03030 New York Municipal Police Los Angeles California July Murder or Manslaughter Yes Male White Unknown Male 0 White Unknown Unknown Handgun FBI
freq 38416 38416 493026 44511 99783 58696 629338 448172 494125 317422 368303 399541 211079 218243 446410 273013 317484 616647
In [7]:
# get all levels per categorical attribute
df_categorical_levels = pd.DataFrame()
df_categorical_levels['Attribute'] = df_categorical.columns
df_categorical_levels['Levels'] = ''
df_categorical_levels['Levels_Count'] = ''
df_categorical_levels['Unknown_Count'] = ''

# populate the dataframe with categorical levels and count of each category
for i, row in df_categorical_levels.iterrows():
    attribute = row['Attribute']
    df_categorical_levels.at[i,'Levels'] = df[attribute].unique()
    df_categorical_levels.at[i,'Levels_Count'] = len(df[attribute].unique())
    try:
        df_categorical_levels.at[i,'Unknown_Count'] = df.groupby(attribute).count().loc['Unknown'][0]
    except: 
        df_categorical_levels.at[i,'Unknown_Count'] = 0
In [8]:
# show the dataframe
df_categorical_levels.sort_values(by='Unknown_Count', ascending = False)
Out[8]:
Attribute Levels Levels_Count Unknown_Count
14 Perpetrator Ethnicity [Unknown, Not Hispanic, Hispanic] 3 446410
10 Victim Ethnicity [Unknown, Not Hispanic, Hispanic] 3 368303
15 Relationship [Acquaintance, Unknown, Wife, Stranger, Girlfriend, Ex-Husband, Brother, Stepdaughter, Husband, Sister, Friend, Family, Neighbor, Father, In-Law, Son, Ex-Wife, Boyfriend, Mother, Common-Law Husband, Common-Law Wife, Stepfather, Stepson, Stepmother, Daughter, Boyfriend/Girlfriend, Employer, Employee] 28 273013
13 Perpetrator Race [Native American/Alaska Native, White, Unknown, Black, Asian/Pacific Islander] 5 196047
11 Perpetrator Sex [Male, Unknown, Female] 3 190365
16 Weapon [Blunt Object, Strangulation, Unknown, Rifle, Knife, Firearm, Shotgun, Fall, Handgun, Drowning, Suffocation, Explosives, Fire, Drugs, Gun, Poison] 16 33192
9 Victim Race [Native American/Alaska Native, White, Black, Unknown, Asian/Pacific Islander] 5 6676
8 Victim Sex [Male, Female, Unknown] 3 984
1 Agency Name [Anchorage, Juneau, Nome, Bethel, North Slope Borough, Kenai, Alaska State Police, Jefferson, Bessemer, Birmingham, Fairfield, Gardendale, Leeds, Homewood, Brighton, Hueytown, Warrior, Mobile, Prichard, Chickasaw, Saraland, Satsuma, Montgomery, Autauga, Baldwin, Robertsdale, Daphne, Barbour, Blount, Bullock, Butler, Greenville, Calhoun, Anniston, Oxford, Chambers, Lafayette, Chilton, Choctaw, Clarke, Thomasville, Cleburne, Enterprise, Colbert, Tuscumbia, Conecuh, Coosa, Rockford, Andalusia, Crenshaw, Cullman, Dale, Ozark, Dallas, Selma, Collinsville, Elmore, Tallassee, Atmore, Brewton, Attalla, Gadsden, Fayette, Franklin, Red Bay, Geneva, Greene, Houston, Dothan, Jackson, Scottsboro, Lamar, Lauderdale, Florence, Lawrence, Lee, Auburn, Opelika, Limestone, Athens, Lowndes, Tuskegee, Madison, Huntsville, New Hope, Owens Crossroads, Marengo, Hackleburg, Marshall, Albertville, Morgan, Decatur, Hartselle, Perry, Marion, Pickens, Troy, Russell, Phenix City, St. Clair, ...] 9216 47
12 Perpetrator Age [15, 42, 0, 36, 27, 35, 40, 49, 39, 29, 19, 23, 33, 26, 41, 28, 61, 25, 7, 17, 34, 21, 43, 38, 66, 32, 37, 22, 30, 24, 65, 51, 60, 45, 64, 18, 20, 44, 73, 62, 69, 52, 16, 90, 56, 47, 57, 31, 46, 14, 55, 50, 54, 68, 77, 53, 67, 48, 76, 58, 71, 63, 89, 13, 75, 72, 93, 59, 10, 11, 79, 74, 99, 70, 78, 80, 9, 12, 81, 6, 87, 82, 83, 8, 84, 85, 86, 5, 3, 1, 88, 95, 4, 91, 92, 96, 94, 2, 98, 97, ...] 191 0
0 Agency Code [AK00101, AK00103, AK00106, AK00113, AK00118, AK00123, AKASP00, AL00100, AL00101, AL00102, AL00104, AL00105, AL00106, AL00110, AL00118, AL00119, AL00123, AL00200, AL00201, AL00202, AL00203, AL00205, AL00207, AL00300, AL00301, AL00400, AL00500, AL00505, AL00508, AL00600, AL00800, AL00900, AL01000, AL01001, AL01100, AL01101, AL01105, AL01200, AL01201, AL01400, AL01500, AL01600, AL01602, AL01800, AL01902, AL02000, AL02002, AL02100, AL02200, AL02202, AL02301, AL02400, AL02500, AL02501, AL02600, AL02601, AL02700, AL02701, AL02802, AL02900, AL02902, AL03001, AL03002, AL03101, AL03102, AL03200, AL03201, AL03300, AL03303, AL03400, AL03500, AL03800, AL03801, AL03900, AL03901, AL04000, AL04100, AL04101, AL04200, AL04300, AL04301, AL04302, AL04400, AL04401, AL04500, AL04601, AL04700, AL04701, AL04704, AL04705, AL04800, AL04904, AL05000, AL05001, AL05200, AL05201, AL05202, AL05300, AL05301, AL05400, ...] 12003 0
7 Crime Solved [Yes, No] 2 0
6 Crime Type [Murder or Manslaughter, Manslaughter by Negligence] 2 0
5 Month [January, March, April, May, June, July, August, December, November, February, October, September] 12 0
4 State [Alaska, Alabama, Arkansas, Arizona, California, Colorado, Connecticut, District of Columbia, Delaware, Florida, Georgia, Hawaii, Iowa, Idaho, Illinois, Indiana, Kansas, Kentucky, Louisiana, Massachusetts, Maryland, Maine, Michigan, Minnesota, Missouri, Mississippi, Montana, Nebraska, North Carolina, North Dakota, New Hampshire, New Jersey, New Mexico, Nevada, New York, Ohio, Oklahoma, Oregon, Pennsylvania, Rhodes Island, South Carolina, South Dakota, Tennessee, Texas, Utah, Virginia, Vermont, Washington, Wisconsin, West Virginia, Wyoming] 51 0
3 City [Anchorage, Juneau, Nome, Bethel, North Slope, Kenai Peninsula, Jefferson, Mobile, Montgomery, Autauga, Baldwin, Barbour, Blount, Bullock, Butler, Calhoun, Chambers, Chilton, Choctaw, Clarke, Cleburne, Coffee, Colbert, Conecuh, Coosa, Covington, Crenshaw, Cullman, Dale, Dallas, De Kalb, Elmore, Escambia, Etowah, Fayette, Franklin, Geneva, Greene, Houston, Jackson, Lamar, Lauderdale, Lawrence, Lee, Limestone, Lowndes, Macon, Madison, Marengo, Marion, Marshall, Morgan, Perry, Pickens, Pike, Russell, St. Clair, Shelby, Sumter, Talladega, Tallapoosa, Tuscaloosa, Walker, Washington, Wilcox, Winston, Arkansas, Ashley, Benton, Boone, Bradley, Chicot, Clark, Clay, Columbia, Conway, Craighead, Crawford, Crittenden, Drew, Faulkner, Fulton, Garland, Grant, Hempstead, Hot Spring, Lafayette, Lincoln, Little River, Lonoke, Miller, Mississippi, Newton, Ouachita, Phillips, Poinsett, Polk, Prairie, Pulaski, St. Francis, ...] 1782 0
2 Agency Type [Municipal Police, County Police, State Police, Sheriff, Special Police, Regional Police, Tribal Police] 7 0
17 Record Source [FBI, FOIA] 2 0

Attributes with the greatest amount of missing data are ethnicity, relationship, and perpetrator race/sex.

In [9]:
# basic statistics for continuous variables
df.describe()
Out[9]:
Record ID Year Incident Victim Age Victim Count Perpetrator Count
count 638454.00000 638454.000000 638454.000000 638454.000000 638454.000000 638454.000000
mean 319227.50000 1995.801102 22.967924 35.033512 0.123334 0.185224
std 184305.93872 9.927693 92.149821 41.628306 0.537733 0.585496
min 1.00000 1980.000000 0.000000 0.000000 0.000000 0.000000
25% 159614.25000 1987.000000 1.000000 22.000000 0.000000 0.000000
50% 319227.50000 1995.000000 2.000000 30.000000 0.000000 0.000000
75% 478840.75000 2004.000000 10.000000 42.000000 0.000000 0.000000
max 638454.00000 2014.000000 999.000000 998.000000 10.000000 10.000000

[15 points] Visualization

Visualize the most important attributes appropriately (at least 5 attributes). Important: Provide an interpretation for each chart. Explain for each attribute why the chosen visualization is appropriate.

In [10]:
fig = px.scatter_matrix(df[['Year', 'Incident', 'Victim Age', 'Victim Count','Perpetrator Count']])
fig.show()

[15 points] EDA

Explore relationships between attributes: Look at the attributes via scatter plots, correlation, cross-tabulation, group-wise averages, etc. as appropriate. Explain any interesting relationships.

In [11]:
# https://gist.github.com/rogerallen/1583593
states = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'American Samoa': 'AS',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhodes Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY'
}
df_state = df.groupby('State').count().reset_index()

df_state['State_Abb'] = [states[full_state] for full_state in df_state['State']]
In [12]:
# heat map of states 

fig = px.choropleth(locations=df_state['State_Abb'], 
                    locationmode="USA-states", 
                    color=df_state['Record ID'], 
                    color_continuous_scale='portland',
                    scope="usa")
fig.update_layout(
    title_text = 'Homicide Rates per State',
    geo_scope='usa', # limite map scope to USA
)
fig.show()

California, Texas, New York, and Florida are the leading states with homocide rates.

In [13]:
df_homicides_per_year = df.groupby('Year').count().reset_index()
df_homicides_per_year_solved = df.groupby(['Year', 'Crime Solved']).count().reset_index()
In [14]:
# Create traces
solved_y = df_homicides_per_year_solved.loc[df_homicides_per_year_solved['Crime Solved'] == 'Yes']
unsolved_y = df_homicides_per_year_solved.loc[df_homicides_per_year_solved['Crime Solved'] == 'No']
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_homicides_per_year['Year'], y=df_homicides_per_year['Record ID'],
                    mode='lines+markers',
                    name='Homicide Rates'))
fig.add_trace(go.Scatter(x=solved_y['Year'], y=solved_y['Record ID'],
                    mode='lines+markers',
                    name='Solved Homicides'))
fig.add_trace(go.Scatter(x=unsolved_y['Year'], y=unsolved_y['Record ID'],
                    mode='lines+markers',
                    name='Unsolved Homicides'))
fig.update_layout(
    title={
        'text': "Homicides Per Year",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title="Year",
    yaxis_title=" Number of Homicides")

fig.show()
In [15]:
df_homicides_solved = pd.DataFrame()
df_homicides_solved['Year'] = df_homicides_per_year['Year']
df_homicides_solved['Unsolved'] = unsolved_y['Record ID'].values
df_homicides_solved['Solved'] = solved_y['Record ID'].values
df_homicides_solved['Total Homicides'] = df_homicides_per_year['Record ID']
In [16]:
# new variable creations
df_homicides_solved['Unsolved_Solved_Diff'] = df_homicides_solved['Solved'] - df_homicides_solved['Unsolved']
df_homicides_solved['Diff_Percentage'] = round((df_homicides_solved['Unsolved_Solved_Diff']/df_homicides_solved['Total Homicides'])*100,2)
In [17]:
fig = px.line(df_homicides_solved, x="Year", y="Diff_Percentage", title='Percentage of Difference of Solved vs. Unsolved Homicides')
fig.show()
In [18]:
df_gender = df.groupby(['Perpetrator Sex', 'Victim Sex', 'Year']).count().reset_index()
df_gender['Perp_Vict'] = df['Perpetrator Sex'].str.cat(df['Victim Sex'],sep=" ")
df_gender['Perp_Vict'].unique()
Out[18]:
array(['Male Male', 'Unknown Female', 'Male Female', 'Unknown Male',
       'Female Male', 'Female Female'], dtype=object)
In [19]:
df_hom = pd.DataFrame()
df_hom['Year'] = ''
df_hom['Perp_Vict'] = ''
years = df['Year'].unique()
combo = df_gender['Perp_Vict'].unique()
for val in combo:
    for i in years:
        df_hom = df_hom.append({'Year': i, 'Perp_Vict': val}, ignore_index=True)
In [20]:
df_hom_perp = df_hom.merge(df_gender, on=['Year', 'Perp_Vict'], how='outer')
In [21]:
fig_gender = px.bar(df_hom_perp, x="Perp_Vict", y="Record ID", color="Perp_Vict",
             animation_frame="Year", animation_group="Perp_Vict", range_y=[0,10000],
             opacity = 0.5)

fig_gender.show()

Victims Graphic


In [22]:
df_victime_gender = df.groupby(['Victim Age', 'Victim Sex', 'Year']).count().reset_index()
In [23]:
px.scatter(df_victime_gender, x="Victim Age", y="Record ID", animation_frame="Year", animation_group="Victim Age",
           size="Record ID", color="Victim Sex", hover_name="Record ID",
           log_x=False, size_max=20, range_x=[0,100], range_y=[0,1200])

Majority of Crime is towards men in their teen-20s, dropping off as age increases

Homicide by City


The code below has been commented out due to thelong run time

In [24]:
# create data frame for unique cities in the dataset
# cities = df['City'].unique()
# df_cities = pd.DataFrame()
# df_cities['City'] = cities
# df_cities['Latitude'] = ''
# df_cities['Longitude'] =''
In [ ]:
# get the lat/lon for the cities for mapping
# geolocator = Nominatim(user_agent='myapplication')
# for i, row in df_cities.iterrows():
#     print(i)
#     location = geolocator.geocode(row['City'])
#     try:
#         df_cities.at[i, 'Latitude'] = location.latitude
#         df_cities.at[i, 'Longitude'] = location.longitude
#     except:
#         print(row['City'] + ' missing')
#         pass
        
# df_cities.head()
In [25]:
# due to the long wait time, the df is exported to a csv and reloaded
# df_cities.to_csv('../Data/city_locations.csv')
df_cities = pd.read_csv('../Data/city_locations.csv')
In [26]:
home_per_city = df.groupby(['City']).count().reset_index()
df_cities = home_per_city.merge(df_cities, on = 'City', how='inner')
df_cities.head()
Out[26]:
City Record ID Agency Code Agency Name Agency Type State Year Month Incident Crime Type ... Perpetrator Race_y Perpetrator Ethnicity_y Relationship_y Weapon_y Victim Count_y Perpetrator Count_y Record Source_y Unnamed: 0.1 Latitude Longitude
0 Abbeville 59 59 59 59 59 59 59 59 59 ... 59 59 59 59 59 59 59 885 50.106084 1.833703
1 Acadia 98 98 98 98 98 98 98 98 98 ... 98 98 98 98 98 98 98 464 30.274074 -92.395704
2 Accomack 80 80 80 80 80 80 80 80 80 ... 80 80 80 80 80 80 80 1060 37.742221 -75.674354
3 Ada 211 211 211 211 211 211 211 211 211 ... 211 211 211 211 211 211 211 323 38.839894 -83.505170
4 Adair 130 130 130 130 130 130 130 130 130 ... 130 130 130 130 130 130 130 1292 40.176978 -92.616117

5 rows × 74 columns

In [27]:
fig = px.density_mapbox(df_cities, lat='Latitude', lon='Longitude', z='Record ID', radius=45,
                        center=dict(lat=40, lon=-100), zoom=3,
                        mapbox_style="stamen-terrain")
fig.show()
In [ ]:
 

[10 points] Discoveries

Identify and explain interesting relationships between features and the class you are trying to predict (i.e., relationships with variables and the target classification).

In [ ]:
 

[5 points] New Feature Creation

Are there other features that could be added to the data or created from existing features? Which ones?

In [ ]:
 
Dummy Code
  • Dummy code the categorical data
  • export to csv due to time required for loop to run
In [28]:
# Function to create dummy variables
def dummy_code(col, df): # input the column names and dataframe
    df_dummy = pd.DataFrame()
    for val in col:
        df_dummy_temp = pd.get_dummies(df[val], prefix=val)
        df_dummy = pd.concat([df_dummy, df_dummy_temp], axis=1, sort=False)
    return df_dummy
In [29]:
# select columns for dummy coding
cat_col = df_categorical.columns.values
categorical = np.delete(cat_col, [0,1])
In [30]:
# call function for dummy coding variables
df_dummy = dummy_code(categorical, df)

The cell below has been commented out to prevent rerunning unless necessary due to computing power required.

In [33]:
continuous_col = list(df.describe().columns.values)
df_continuous = df[continuous_col]
df_dummy = df_dummy.merge(df_continuous, on=None, left_index=True, right_index=True, how='outer')
df_dummy = df_dummy.merge(df[['Agency Name', 'Agency Code']], on=None, left_index=True, right_index=True, how='outer')
In [35]:
# export to csv
df_dummy.to_csv('../Data/Dummy_coded_database.csv')

Exceptional Work (10 points total)


• You have free reign to provide additional analyses. • One idea: implement dimensionality reduction, then visualize and interpret the results.

Data Manipulation

In this example, the data set will be used to determine the probability that a crime will be solved Response Variable: Crime Solved

In [ ]:
# scale the data
df_full = pd.read_csv('../Data/Dummy_coded_database.csv')
df_full = df_full.drop('Unnamed: 0', axis=1)
In [37]:
try:
    df_full = df_dummy
else:
    pass
df_full.head()
Out[37]:
Agency Type_County Police Agency Type_Municipal Police Agency Type_Regional Police Agency Type_Sheriff Agency Type_Special Police Agency Type_State Police Agency Type_Tribal Police City_Abbeville City_Acadia City_Accomack ... Record Source_FBI Record Source_FOIA Record ID Year Incident Victim Age Victim Count Perpetrator Count Agency Name Agency Code
0 0 1 0 0 0 0 0 0 0 0 ... 1 0 1 1980 1 14 0 0 Anchorage AK00101
1 0 1 0 0 0 0 0 0 0 0 ... 1 0 2 1980 1 43 0 0 Anchorage AK00101
2 0 1 0 0 0 0 0 0 0 0 ... 1 0 3 1980 2 30 0 0 Anchorage AK00101
3 0 1 0 0 0 0 0 0 0 0 ... 1 0 4 1980 1 43 0 0 Anchorage AK00101
4 0 1 0 0 0 0 0 0 0 0 ... 1 0 5 1980 2 30 0 1 Anchorage AK00101

5 rows × 2123 columns

Train/Test Split

- Train/Test split due to the large data size and for data validation Resource

In [38]:
# set seed
random.seed(1234)
df_clean = df_full.drop(['Agency Name', 'Agency Code'], axis=1)
# split into train/test
y = df_clean['Crime Solved_Yes']
x = df_clean.drop(['Crime Solved_Yes', 'Crime Solved_No'], axis = 1)
In [39]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.8)
Balancing the Dataset

The data set is skewed heavily to the yes side, as shown in table below

In [40]:
# check for a balanced dataset
df_crime = df_full[['Crime Solved_Yes', 'Crime Solved_No']].groupby('Crime Solved_Yes').count().reset_index().rename(columns={'Crime Solved_No':'Count'})
df_crime['Solved'] = ['No', 'Yes']
df_crime = df_crime.drop('Crime Solved_Yes', axis=1)
total = df_crime['Count'].sum()
df_crime['Percentage'] = [x/total for x in df_crime['Count']]
df_crime.head()
Out[40]:
Count Solved Percentage
0 190282 No 0.298036
1 448172 Yes 0.701964

Down sampling will be used to balance out the data. Resource

In [41]:
training_df = x_train
training_df['Crime Solved_Yes'] = y_train
In [42]:
# Separate majority and minority classes
df_majority = training_df[training_df['Crime Solved_Yes']==1]
df_minority = training_df[training_df['Crime Solved_Yes']==0]
 
# Downsample majority class
df_majority_downsampled = resample(df_majority, 
                                 replace=False,    # sample without replacement
                                 n_samples=len(df_minority),     # to match minority class
                                 random_state=123) # reproducible results
 
# # Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])
 
In [43]:
# Display new class counts
df_downsampled.groupby('Crime Solved_Yes').count()['Agency Type_County Police']
Out[43]:
Crime Solved_Yes
0    37963
1    37963
Name: Agency Type_County Police, dtype: int64
In [44]:
df_downsampled.to_csv("../Data/training_data_set.csv") 
# save to csv due to large data file, prevent the need to rerun code

Create a new split based off the balanced training sets


Due to the large dataset size, rather than doing a traditional train split, the method below resulted in quicker computation and prevented compuatational timeout

In [45]:
df_training = pd.read_csv("../Data/training_data_set.csv")
index_train = df_training['Unnamed: 0']
df_training = df_training.drop('Unnamed: 0', axis=1)
In [50]:
# index_train.values
full_ind = df_full.index.values
train_ind = index_train.values
mask = np.isin(full_ind, train_ind, invert=True)
test_ind = full_ind[mask]
print('actual_index: ' + str(len(test_ind))+', test_index: ' + str(len(index_train.values) - 638454))
actual_index: 562528, test_index: -562528
In [51]:
df_test = df_full.iloc[test_ind]
df_test.to_csv("../Data/test_data_set.csv")

Final Train/Test Split


Read in the data frames from saved csv to prevent repetitive computation

In [ ]:
df_training = pd.read_csv("../Data/training_data_set.csv")
df_test = pd.read_csv("../Data/test_data_set.csv")
In [215]:
x_train = df_training.drop('Crime Solved_Yes', axis=1)
x_test =  df_test.drop('Crime Solved_Yes', axis=1)
y_train = df_training['Crime Solved_Yes']
y_test = df_test['Crime Solved_Yes']
PCA Code

  1. PCA is only transformable on continuous data. While it can be run on dummy coded categorical variables, PCA is meant only for continuous attributes, and deviation can result in midleading conclusions.

  2. Since PCA is sensitive to scales, the first step is to scale the data Resource

In [60]:
continuous_col = list(df.describe().columns.values)
pca_x_train = x_train[continuous_col]
pca_x_train.head()
Out[60]:
Record ID Year Incident Victim Age Victim Count Perpetrator Count
0 387185 1998 1 18 1 3
1 36194 1981 9 30 7 0
2 136966 1986 1 49 0 0
3 23688 1981 4 22 0 0
4 202350 1990 1 34 0 0

PCA Model

In [54]:
# Standardizing the features
x = StandardScaler().fit_transform(pca_x_train)
In [56]:
pca = PCA(n_components=6)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents, columns=['PCA_'+ str(x) for x in range(6)])
In [57]:
df_PCA = pd.concat([principalDf, y], axis=1)
In [58]:
fig = px.scatter(principalComponents, x=df_PCA['PCA_0'], y=df_PCA['PCA_1'], color=df_PCA['Crime Solved_Yes'],
                width=600, height=300)
fig.update_layout(title='PCA 1 vs. PCA 2',
                  yaxis_zeroline=False, xaxis_zeroline=False)
fig.update_xaxes(title_text='PCA 1')
fig.update_yaxes(title_text='PCA 2')
fig.show()
Feature Selection

The feature selection used in python is Recursive Feature Elimination(RFE), which recursively selects smaller and smaller amount of features until the final recommended features are selected.

First, feature selection on the attributes without dummy coding are tested to narrow down the features to the top 5.Resource Then, the top 5 features are run through RFE to further eliminate unnecessary attributes. Resource

In [119]:
# load the dataset
def load_dataset(filename):
    # load the dataset as a pandas DataFrame
    data = read_csv(filename, header=None)
    # retrieve numpy array
    dataset = data.values
    # split into input (X) and output (y) variables
    X = dataset[:, :-1]
    y = dataset[:,-1]
    # format all fields as string
    X = X.astype(str)
    return X, y

# prepare input data
def prepare_inputs(X_train, X_test, all_test):
    oe = OrdinalEncoder()
    oe.fit(all_test)
    X_train_enc = oe.transform(X_train)
    X_test_enc = oe.transform(X_test)
    return X_train_enc, X_test_enc

# prepare target
def prepare_targets(y_train, y_test):
    le = LabelEncoder()
    le.fit(y_train)
    y_train_enc = le.transform(y_train)
    y_test_enc = le.transform(y_test)
    return y_train_enc, y_test_enc

# feature selection
def select_features(X_train, y_train, X_test):
    fs = SelectKBest(score_func=chi2, k='all')
    fs.fit(X_train, y_train)
    X_train_fs = fs.transform(X_train)
    X_test_fs = fs.transform(X_test)
    return X_train_fs, X_test_fs, fs
In [147]:
df_logit = df.drop(['Record ID', 'Incident'], axis=1)
In [148]:
x_df = df_logit.drop('Crime Solved', axis=1)
y_df = df_logit['Crime Solved']
x_train_m,x_test_m,y_train_m,y_test_m=train_test_split(x_df,y_df,test_size=0.8)
In [149]:
x_train_m['Crime Solved'] = y_train_m
df_majority_m = x_train_m[x_train_m['Crime Solved']=='Yes']
df_minority_m = x_train_m[x_train_m['Crime Solved']=='No']
 
# Downsample majority class
df_majority_downsampled_m = resample(df_majority_m, 
                                 replace=False,    # sample without replacement
                                 n_samples=len(df_minority_m),     # to match minority class
                                 random_state=123) # reproducible results

df_majority_downsampled_m = pd.concat([df_majority_downsampled_m, df_minority_m])
In [150]:
df_test_m = df_logit.merge(df_majority_downsampled_m, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']
In [153]:
x_m = df_majority_downsampled_m.drop('Crime Solved', axis=1).astype(str)
y_m = df_majority_downsampled_m['Crime Solved'].astype(str)
x_test_m = df_test_m.drop(['Crime Solved','_merge'], axis=1).astype(str)
y_test_m = df_test_m['Crime Solved'].astype(str)
x_test_all = df_logit.drop('Crime Solved', axis=1).astype(str)
In [154]:
# prepare input data
X_train_enc, X_test_enc = prepare_inputs(x_m, x_test_m, x_test_all)
# prepare output data
y_train_enc, y_test_enc = prepare_targets(y_m, y_test_m)
# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train_enc, y_train_enc, X_test_enc)
In [157]:
features = pd.DataFrame()
features['Attributes'] = x_m.columns
features['Feature Selection'] = fs.scores_.round(2)
features.sort_values(by='Feature Selection', ascending=False).head(5)
Out[157]:
Attributes Feature Selection
0 Agency Code 1235063.09
13 Perpetrator Age 795057.05
16 Relationship 104898.83
12 Perpetrator Sex 16159.81
19 Perpetrator Count 8102.46
In [159]:
# what are scores for the features
fig = px.bar(features, x='Attributes', y='Feature Selection')
fig.show()

From the graph above, the top five attributes which are the top predictors for whether or not a homocide is solved are

  1. Agency Code
  2. Perpetrator Age
  3. Relationship
  4. Sex
  5. Perpetrator Count

Next, using linear regression and RFE, the attributes will be reduced further.

In [216]:
top_5 = features.sort_values(by='Feature Selection', ascending=False).head(5)['Attributes'].values
filter_col = [x for x in x_train if x.startswith(tuple(top_5))]
x_train2 = x_train[filter_col]
In [217]:
# from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
rfe = RFE(logreg, 5)
rfe = rfe.fit(x_train2, y_train.values.ravel())
print(rfe.support_)
print(rfe.ranking_)
[ True  True  True  True False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False  True False False]
[  1   1   1   1  73 197 169 179 176 158 127 156 153 162 102 100  71  58
  40   3  18   8   9  11  17  13  14  12  21  16  19  22  20   5  25  27
  24  30   4  28  31  37  33  36  35  38  41  39  42  48  51  47 113 160
  49  54  52  75  59  62 132  61  72  65  76  66  67  80  78  82  77  84
  92  90  96 108 109 110 115 125 124  99 134 118 147 138 142 144 131 165
 145 177 155 128 170 167 137 207 206 196 209 192  57 123   2 219 218 202
 182 168 194 120 150 163  55 103  83  94  97  88  89 139  79  98  91  81
 190  93 121 105 112 107  95 117 114 126 106 111 119 104 148 122 143 116
 136 149 129 140 141 130 151 152 157 161 133 166 154 217 175 146 174 159
 164 185 181 191 184 178 213 172 189 193 188 186 180 173 210 216 171 208
 215 214 183 187 195 211 198 199 205 200 212 201 203 204  32  15  87  50
  45  63  29  74  85  68  70  44  53  34  43  10  56  46  60  69  23 101
  64 135  86   6   1  26   7]
In [218]:
rfe_features = pd.DataFrame()
rfe_features['Attributes'] = x_train2.columns
rfe_features['Significant'] = rfe.support_
rfe_features.loc[rfe_features['Significant'] == True]
Out[218]:
Attributes Significant
0 Perpetrator Sex_Female True
1 Perpetrator Sex_Male True
2 Perpetrator Sex_Unknown True
3 Perpetrator Age_0 True
220 Relationship_Unknown True

According to RFE, the only attributes which are significant are the perpetrator's sex, if the perpetrator's age is 0 (indicating that their age is unknown), and if the relationship is unknown

In [219]:
rfe_top_5 = rfe_features.loc[rfe_features['Significant'] == True]['Attributes'].values
In [220]:
x_train3 = x_train2[rfe_top_5]
In [221]:
rfe_features.loc[rfe_features['Significant'] == True]['Attributes'].values
Out[221]:
array(['Perpetrator Sex_Female', 'Perpetrator Sex_Male',
       'Perpetrator Sex_Unknown', 'Perpetrator Age_0',
       'Relationship_Unknown'], dtype=object)
Logistic Regression

Resource

In [222]:
logit_model=sm.Logit(y_train,x_train3[['Perpetrator Sex_Male',
       'Perpetrator Sex_Unknown', 'Perpetrator Age_0',
       'Relationship_Unknown']])
result=logit_model.fit()
print(result.summary2())
Optimization terminated successfully.
         Current function value: 0.036163
         Iterations 12
                              Results: Logit
===========================================================================
Model:                   Logit               Pseudo R-squared:    0.948    
Dependent Variable:      Crime Solved_Yes    AIC:                 5499.4563
Date:                    2020-09-07 23:26    BIC:                 5536.4064
No. Observations:        75926               Log-Likelihood:      -2745.7  
Df Model:                3                   LL-Null:             -52628.  
Df Residuals:            75922               LLR p-value:         0.0000   
Converged:               1.0000              Scale:               1.0000   
No. Iterations:          12.0000                                           
---------------------------------------------------------------------------
                         Coef.   Std.Err.    z     P>|z|   [0.025   0.975] 
---------------------------------------------------------------------------
Perpetrator Sex_Male      7.9572   0.3334  23.8667 0.0000   7.3038   8.6107
Perpetrator Sex_Unknown -13.4677   0.5949 -22.6397 0.0000 -14.6336 -12.3017
Perpetrator Age_0         2.1507   0.4307   4.9931 0.0000   1.3065   2.9949
Relationship_Unknown      3.6977   0.3374  10.9595 0.0000   3.0364   4.3590
===========================================================================

In [223]:
x_train4 = x_train3[['Perpetrator Sex_Male',
       'Perpetrator Sex_Unknown', 'Perpetrator Age_0',
       'Relationship_Unknown']]
In [224]:
logreg = LogisticRegression()
logreg.fit(x_train4, y_train)
Out[224]:
LogisticRegression()
In [225]:
x_test2 = x_test[['Perpetrator Sex_Male',
       'Perpetrator Sex_Unknown', 'Perpetrator Age_0',
       'Relationship_Unknown']]
age_0 = x_test2.iloc[:,2]
x_test2 = x_test2.drop('Perpetrator Age_0', axis=1)
x_test2['Perpetrator Age_0'] = age_0
x_test2.head()
Out[225]:
Perpetrator Sex_Male Perpetrator Sex_Unknown Relationship_Unknown Perpetrator Age_0
0 1 0 0 0
1 1 0 0 0
2 0 1 1 1
3 1 0 0 0
4 0 1 1 1
In [226]:
y_pred = logreg.predict(x_test2)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(x_test2, y_test)))
Accuracy of logistic regression classifier on test set: 1.00
In [227]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)
[[152264     55]
 [   130 410079]]
In [230]:
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(logreg, x_test2, y_test)  # doctest: +SKIP
plt.show()  # doctest: +SKIP
In [228]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    152319
           1       1.00      1.00      1.00    410209

    accuracy                           1.00    562528
   macro avg       1.00      1.00      1.00    562528
weighted avg       1.00      1.00      1.00    562528

In [231]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, logreg.predict(x_test2))
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(x_test2)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
In [ ]: